Lecture 11 : The Good - Turing Estimate

نویسندگان

  • Ellis Weng
  • Andrew Owens
چکیده

In many language-related tasks, it would be extremely useful to know the probability that a sentence or word sequence will occur in a document. However, there is not enough data to account for all word sequences. Thus, n-gram models are used to approximate the probability of word sequences. Making an independence assumption between the n-grams reduces some of the problems with data sparsity, but even n-gram models can have sparsity problems. For example, the Google corpus has 1 trillion words of running English text. There are 13 million words that occur over 200 times, so there are at least 169 trillion potential bigrams much more than the 1 trillion words in the corpus. Smoothing is a strategy used to account for this data sparsity. In this lecture, we will explore Good-Turing smoothing, a particular kind of smoothing.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lecture 11 : Randomization and Complexity

Definition 1.1. A probabilistic Turing machine (PTM) M is a TM with two transition functions in place of the usual one: δ0 : Q × Γ → Q × (Γ × L, S,R) and δ1 : Q × Γ → Q× (Γ× L, S,R). At each time step, independently, the algorithm chooses b from {0, 1} with equal probability 1/2 and makes its move using transition function δb. The running time of M is the maximum number of steps before M halts ...

متن کامل

Lecture 1 : Course Overview and Turing machine complexity

1. Basic properties of Turing Machines (TMs), Circuits & Complexity 2. P, NP, NP-completeness, Cook-Levin Theorem. 3. Hierarchy theorems, Circuit lower bounds. 4. Space complexity: PSPACE, PSPACE-completeness, L, NL, Closure properties 5. Polynomial-time hierarchy 6. #P and counting problems 7. Randomized complexity 8. Circuit lower bounds 9. Interactive proofs & PCP Theorem Hardness of approxi...

متن کامل

Good-Turing Smoothing Without Tears

The performance of statistically based techniques for many tasks such as spelling correction, sense disambiguation, and translation is improved if one can estimate a probability for an object of interest which has not been seen before. Good-Turing methods are one means of estimating these probabilities for previously unseen objects. However, the use of Good-Turing methods requires a smoothing s...

متن کامل

Neural field modelling

The tools of dynamical systems theory are having an increasing impact on our understanding of patterns of neural activity. In these five lectures I will describe how to build tractable tissue level models that maintain a strong link with biophysical reality. These models typically take the form of nonlinear integro-differential equations. Their non-local nature has led to the development of a s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010